1.Predicting the likelihood of a diabetic patient getting readmitted (Based on 70000 medical records from 130 US hospitals from 1999-2008) 2. Chemical medication vs Biological medication 3. Who has the higher probability of having higher Hb1ac
dData%>%
knitr::kable(caption = "Data variables description", digits = 3)%>%
kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE,position = "left")
## Warning in kableExtra::kable_styling(., bootstrap_options = "striped",
## full_width = FALSE, : Please specify format in kable. kableExtra can customize
## either HTML or LaTeX outputs. See https://haozhu233.github.io/kableExtra/ for
## details.
| Variable | description |
|---|---|
| encounter_id | Id given during visit of patient |
| patient_nbr | Patient number |
| race | Race of the patient |
| Gender | Gender of patient |
| age | age of patient |
| weight | weight of patient |
| admission_type_id | 1=Emergency |
| 2=Urgent | |
| 3=Elective | |
| 4=Newborn | |
| 5=Not Available | |
| 6=NULL | |
| 7=Trauma Center | |
| 8=Not Mapped | |
| discharge_disposition_id | 1=Discharged to home |
| 2=Discharged/transferred to another short term hospital | |
| 3=Discharged/transferred to SNF | |
| 4=Discharged/transferred to ICF | |
| 5=Discharged/transferred to another type of inpatient care institution | |
| 6=Discharged/transferred to home with home health service | |
| 7=Left AMA | |
| 8=Discharged/transferred to home under care of Home IV provider | |
| 9=Admitted as an inpatient to this hospital | |
| 10=Neonate discharged to another hospital for neonatal aftercare | |
| 11=Expired | |
| 12=Still patient or expected to return for outpatient services | |
| 13=Hospice / home | |
| 14=Hospice / medical facility | |
| 15=Discharged/transferred within this institution to Medicare approved swing bed | |
| 16=Discharged/transferred/referred another institution for outpatient services | |
| 17=Discharged/transferred/referred to this institution for outpatient services | |
| 18=NULL | |
| 19=Expired at home. Medicaid only, hospice. | |
| 20=Expired in a medical facility. Medicaid only, hospice. | |
| 21=Expired, place unknown. Medicaid only, hospice. | |
| 22=Discharged/transferred to another rehab fac including rehab units of a hospital . | |
| 23=Discharged/transferred to a long term care hospital. | |
| 24=Discharged/transferred to a nursing facility certified under Medicaid but not certified under Medicare. | |
| 25=Not Mapped | |
| 26=Unknown/Invalid | |
| 30=Discharged/transferred to another Type of Health Care Institution not Defined Elsewhere | |
| 27=Discharged/transferred to a federal health care facility. | |
| 28=Discharged/transferred/referred to a psychiatric hospital of psychiatric distinct part unit of a hospital | |
| 29=Discharged/transferred to a Critical Access Hospital (CAH). | |
| admission_source_id | 1= Physician Referral |
| 2=Clinic Referral | |
| 3=HMO Referral | |
| 4=Transfer from a hospital | |
| 5= Transfer from a Skilled Nursing Facility (SNF) | |
| 6= Transfer from another health care facility | |
| 7= Emergency Room | |
| 8= Court/Law Enforcement | |
| 9= Not Available | |
| 10= Transfer from critial access hospital | |
| 11=Normal Delivery | |
| 12= Premature Delivery | |
| 13= Sick Baby | |
| 14= Extramural Birth | |
| 15=Not Available | |
| 17=NULL | |
| 18= Transfer From Another Home Health Agency | |
| 19=Readmission to Same Home Health Agency | |
| 20= Not Mapped | |
| 21=Unknown/Invalid | |
| 22= Transfer from hospital inpt/same fac reslt in a sep claim | |
| 23= Born inside this hospital | |
| 24= Born outside this hospital | |
| 25= Transfer from Ambulatory Surgery Center | |
| 26=Transfer from Hospice | |
| time_in_hospital | Time spent in hospital in months |
| payer_code | Payer payment code |
| medical_specialty | Area/field of medicne |
| num_lab_procedures | lab procedures available |
| num_procedures | lab procedures done |
| num_medications | number of medications |
| number_emergency | admitted as an emegency |
| number_outpatient | number of times admitted as an out patient |
| number_inpatient | number of times admitted as an in patient |
| diag_1 | diagnosis 1 |
| diag_2 | diagnosis 2 |
| diag_3 | diagnosis 3 |
| number_diagnoses | number of diagnoses done |
| max_glu_serum | Glucose serum test result |
| A1Cresult | Hb A1C or hemoglobin A1c (shows suger level in blood) |
| metformin | one of the feature of medication |
| repaglinide | one of the feature of medication |
| nateglinide | one of the feature of medication |
| chlorpropamide | one of the feature of medication |
| glimepiride | one of the feature of medication |
| acetohexamide | one of the feature of medication |
| glipizide | one of the feature of medication |
| glyburide | one of the feature of medication |
| tolbutamide | one of the feature of medication |
| pioglitazone | one of the feature of medication |
| rosiglitazone | one of the feature of medication |
| acarbose | one of the feature of medication |
| miglitol | one of the feature of medication |
| troglitazone | one of the feature of medication |
| tolazamide | one of the feature of medication |
| examide | one of the feature of medication |
| citoglipton | one of the feature of medication |
| insulin | one of the feature of medication |
| glyburide-metformin | one of the feature of medication |
| glipizide-metformin | one of the feature of medication |
| glimepiride-pioglitazone | one of the feature of medication |
| metformin-rosiglitazone | one of the feature of medication |
| metformin-pioglitazone | one of the feature of medication |
| change | Change of medication |
| diabetesMed | Diabetes medications |
| readmitted | Readmission to hospitel |
skim(mData) # Computing statistics by data type
| Name | mData |
| Number of rows | 101766 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| factor | 37 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| race | 0 | 1 | FALSE | 6 | Cau: 76099, Afr: 19210, ?: 2273, His: 2037 |
| gender | 0 | 1 | FALSE | 3 | Fem: 54708, Mal: 47055, Unk: 3 |
| age | 0 | 1 | FALSE | 10 | [70: 26068, [60: 22483, [50: 17256, [80: 17197 |
| weight | 0 | 1 | FALSE | 10 | ?: 98569, [75: 1336, [50: 897, [10: 625 |
| payer_code | 0 | 1 | FALSE | 18 | ?: 40256, MC: 32439, HM: 6274, SP: 5007 |
| medical_specialty | 0 | 1 | FALSE | 73 | ?: 49949, Int: 14635, Eme: 7565, Fam: 7440 |
| diag_1 | 0 | 1 | FALSE | 717 | 428: 6862, 414: 6581, 786: 4016, 410: 3614 |
| diag_2 | 0 | 1 | FALSE | 749 | 276: 6752, 428: 6662, 250: 6071, 427: 5036 |
| diag_3 | 0 | 1 | FALSE | 790 | 250: 11555, 401: 8289, 276: 5175, 428: 4577 |
| max_glu_serum | 0 | 1 | FALSE | 4 | Non: 96420, Nor: 2597, >20: 1485, >30: 1264 |
| A1Cresult | 0 | 1 | FALSE | 4 | Non: 84748, >8: 8216, Nor: 4990, >7: 3812 |
| metformin | 0 | 1 | FALSE | 4 | No: 81778, Ste: 18346, Up: 1067, Dow: 575 |
| repaglinide | 0 | 1 | FALSE | 4 | No: 100227, Ste: 1384, Up: 110, Dow: 45 |
| nateglinide | 0 | 1 | FALSE | 4 | No: 101063, Ste: 668, Up: 24, Dow: 11 |
| chlorpropamide | 0 | 1 | FALSE | 4 | No: 101680, Ste: 79, Up: 6, Dow: 1 |
| glimepiride | 0 | 1 | FALSE | 4 | No: 96575, Ste: 4670, Up: 327, Dow: 194 |
| acetohexamide | 0 | 1 | FALSE | 2 | No: 101765, Ste: 1 |
| glipizide | 0 | 1 | FALSE | 4 | No: 89080, Ste: 11356, Up: 770, Dow: 560 |
| glyburide | 0 | 1 | FALSE | 4 | No: 91116, Ste: 9274, Up: 812, Dow: 564 |
| tolbutamide | 0 | 1 | FALSE | 2 | No: 101743, Ste: 23 |
| pioglitazone | 0 | 1 | FALSE | 4 | No: 94438, Ste: 6976, Up: 234, Dow: 118 |
| rosiglitazone | 0 | 1 | FALSE | 4 | No: 95401, Ste: 6100, Up: 178, Dow: 87 |
| acarbose | 0 | 1 | FALSE | 4 | No: 101458, Ste: 295, Up: 10, Dow: 3 |
| miglitol | 0 | 1 | FALSE | 4 | No: 101728, Ste: 31, Dow: 5, Up: 2 |
| troglitazone | 0 | 1 | FALSE | 2 | No: 101763, Ste: 3 |
| tolazamide | 0 | 1 | FALSE | 3 | No: 101727, Ste: 38, Up: 1 |
| examide | 0 | 1 | FALSE | 1 | No: 101766 |
| citoglipton | 0 | 1 | FALSE | 1 | No: 101766 |
| insulin | 0 | 1 | FALSE | 4 | No: 47383, Ste: 30849, Dow: 12218, Up: 11316 |
| glyburide.metformin | 0 | 1 | FALSE | 4 | No: 101060, Ste: 692, Up: 8, Dow: 6 |
| glipizide.metformin | 0 | 1 | FALSE | 2 | No: 101753, Ste: 13 |
| glimepiride.pioglitazone | 0 | 1 | FALSE | 2 | No: 101765, Ste: 1 |
| metformin.rosiglitazone | 0 | 1 | FALSE | 2 | No: 101764, Ste: 2 |
| metformin.pioglitazone | 0 | 1 | FALSE | 2 | No: 101765, Ste: 1 |
| change | 0 | 1 | FALSE | 2 | No: 54755, Ch: 47011 |
| diabetesMed | 0 | 1 | FALSE | 2 | Yes: 78363, No: 23403 |
| readmitted | 0 | 1 | FALSE | 3 | NO: 54864, >30: 35545, <30: 11357 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| encounter_id | 0 | 1 | 165201645.62 | 102640295.98 | 12522 | 84961194 | 152388987 | 230270888 | 443867222 | <U+2586><U+2587><U+2585><U+2582><U+2582> |
| patient_nbr | 0 | 1 | 54330400.69 | 38696359.35 | 135 | 23413221 | 45505143 | 87545950 | 189502619 | <U+2587><U+2586><U+2586><U+2581><U+2581> |
| admission_type_id | 0 | 1 | 2.02 | 1.45 | 1 | 1 | 1 | 3 | 8 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| discharge_disposition_id | 0 | 1 | 3.72 | 5.28 | 1 | 1 | 1 | 4 | 28 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| admission_source_id | 0 | 1 | 5.75 | 4.06 | 1 | 1 | 7 | 7 | 25 | <U+2585><U+2587><U+2581><U+2581><U+2581> |
| time_in_hospital | 0 | 1 | 4.40 | 2.99 | 1 | 2 | 4 | 6 | 14 | <U+2587><U+2585><U+2582><U+2581><U+2581> |
| num_lab_procedures | 0 | 1 | 43.10 | 19.67 | 1 | 31 | 44 | 57 | 132 | <U+2583><U+2587><U+2585><U+2581><U+2581> |
| num_procedures | 0 | 1 | 1.34 | 1.71 | 0 | 0 | 1 | 2 | 6 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| num_medications | 0 | 1 | 16.02 | 8.13 | 1 | 10 | 15 | 20 | 81 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| number_outpatient | 0 | 1 | 0.37 | 1.27 | 0 | 0 | 0 | 0 | 42 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_emergency | 0 | 1 | 0.20 | 0.93 | 0 | 0 | 0 | 0 | 76 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_inpatient | 0 | 1 | 0.64 | 1.26 | 0 | 0 | 0 | 1 | 21 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_diagnoses | 0 | 1 | 7.42 | 1.93 | 1 | 6 | 8 | 9 | 16 | <U+2581><U+2585><U+2587><U+2581><U+2581> |
# Converting variables with categorical values into factors
mData$admission_type_id <- as.factor(mData$admission_type_id)
mData$discharge_disposition_id <- as.factor(mData$discharge_disposition_id)
mData$admission_source_id <- as.factor(mData$admission_source_id)
# Replacing instances of variables where there is "?" or "Unknown/Invalid"
count <- 0
for(i in 1:ncol(mData)){
if(is.factor(mData[,i])){
for(j in 1:nrow(mData)){
if(mData[j,i]== "?" | mData[j,i]== "Unknown/Invalid" ){
count <- count + 1
mData[j,i] <- NA
}
}
if(count > 0){
print(c(colnames(mData)[i],count))
}
}
count <- 0
}
## [1] "race" "2273"
## [1] "gender" "3"
## [1] "weight" "98569"
## [1] "payer_code" "40256"
## [1] "medical_specialty" "49949"
## [1] "diag_1" "21"
## [1] "diag_2" "358"
## [1] "diag_3" "1423"
dim(mData)
## [1] 101766 50
# Heat map to see missing data of variables
heatmap(1 * is.na(mData), Rowv = NA, Colv = NA)
mData$x <- NULL # Removing empty column that is first column
mData$medical_specialty <- NULL #We can either keep this to just show some classification or can drop as it is no use for our analysis
mData$weight <- NULL # Removing weight as instances are not available due to privacy concern
mData$encounter_id <- NULL # This is not necessary as we wont be analyzing anything out of it
mData$payer_code <- NULL #Removing payer_code as instances are not available due to privacy concern
mData$examide <- NULL #Monotonous, only has one values
mData$citoglipton <- NULL #Monotonous, only has one values
# mData[complete.cases(mData), ] # Displays all instances which has complete data
# mData[!complete.cases(mData), ] # Diplasys all instances which has NA in any variable
mDatao <- na.omit(mData) # Omitting all the instances where values are NA
dim(mDatao) # Updated dimension of data
## [1] 98052 44
str(mDatao) # Updated Data statistics
## 'data.frame': 98052 obs. of 44 variables:
## $ patient_nbr : int 55629189 86047875 82442376 42519267 82637451 84259809 114882984 48330783 63555939 89869032 ...
## $ race : Factor w/ 6 levels "?","AfricanAmerican",..: 4 2 4 4 4 4 4 4 4 2 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 2 2 2 2 2 1 1 1 ...
## $ age : Factor w/ 10 levels "[0-10)","[10-20)",..: 2 3 4 5 6 7 8 9 10 5 ...
## $ admission_type_id : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 2 3 1 2 3 1 ...
## $ discharge_disposition_id: Factor w/ 26 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 3 1 ...
## $ admission_source_id : Factor w/ 17 levels "1","2","3","4",..: 7 7 7 7 2 2 7 4 4 7 ...
## $ time_in_hospital : int 3 2 2 1 3 4 5 13 12 9 ...
## $ num_lab_procedures : int 59 11 44 51 31 70 73 68 33 47 ...
## $ num_procedures : int 0 5 1 0 6 1 0 2 3 2 ...
## $ num_medications : int 18 13 16 8 16 21 12 28 18 17 ...
## $ number_outpatient : int 0 2 0 0 0 0 0 0 0 0 ...
## $ number_emergency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : int 0 1 0 0 0 0 0 0 0 0 ...
## $ diag_1 : Factor w/ 717 levels "?","10","11",..: 145 456 556 56 265 265 278 254 284 122 ...
## $ diag_2 : Factor w/ 749 levels "?","11","110",..: 81 80 99 26 248 248 316 262 48 243 ...
## $ diag_3 : Factor w/ 790 levels "?","11","110",..: 123 768 250 88 88 772 88 231 319 668 ...
## $ number_diagnoses : int 9 6 7 5 9 7 8 8 8 9 ...
## $ max_glu_serum : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 3 2 2 2 2 ...
## $ repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nateglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ chlorpropamide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glimepiride : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 3 2 2 2 2 ...
## $ acetohexamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 3 2 3 2 2 2 3 2 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ tolbutamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 3 2 ...
## $ acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ miglitol : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ tolazamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 4 2 4 3 3 3 2 3 3 3 ...
## $ glyburide.metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glipizide.metformin : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.pioglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ change : Factor w/ 2 levels "Ch","No": 1 2 1 1 2 1 2 1 1 2 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ readmitted : Factor w/ 3 levels "<30",">30","NO": 2 3 3 3 2 3 2 3 3 2 ...
## - attr(*, "na.action")= 'omit' Named int 1 20 21 22 55 66 67 88 100 112 ...
## ..- attr(*, "names")= chr "1" "20" "21" "22" ...
# library(heatmaply)
# heatmaply_na(mDatao,showticklabels = c(TRUE, FALSE))
#
# round(cor(mDatao),2)%>%
# knitr::kable(caption = "", digits = 3)%>%
# kableExtra::kable_styling(bootstrap_options = "striped", full_width = FALSE,position = "left")%>%
# row_spec(0, bold = T)%>%
# column_spec(1,bold = TRUE, italic = TRUE)
#The variable discharge__disposition_id informs us about where the patient went getting discharged from the hospital. 11,13,14,19,20,21 can be related to death or hospice, which implies that we need to remove them from as they will not be getting readmitted.
par(mfrow = c(1,2))
barplot(table(mDatao$discharge_disposition_id), main = "Before dropping")
mDatao <- mDatao[!mDatao$discharge_disposition_id %in% c(11,13,14,19,20,21), ]
barplot(table(mDatao$discharge_disposition_id), main = "After dropping")
#I am renaming admission_type_id to admission_type and then I am going to collapse their factors and club some of them together as they are similar
colnames(mDatao)[5] <- "admission_type"
barplot(table(mDatao$admission_type))
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 2, 1)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 8, 5)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 6, 5)
mDatao$admission_type <- replace(mDatao$admission_type,mDatao$admission_type == 7, 1)
barplot(table(mDatao$admission_type), main = "Admission types after data collapsing")
#I am changing name of factors in the variable for better understanding
mDatao$admission_type <- str_replace(mDatao$admission_type,"1","Emergency")
mDatao$admission_type <- str_replace(mDatao$admission_type,"5","Other")
mDatao$admission_type <- str_replace(mDatao$admission_type,"3","Elective")
mDatao$admission_type <- str_replace(mDatao$admission_type,"4","Newborn")
mDatao$admission_type <- as.factor(mDatao$admission_type)
barplot(table(mDatao$admission_type))
#I am renaming variable "admission_source_id" to "admission_source"
colnames(mDatao)[7] <- "admission_source"
barplot(table(mDatao$admission_source))
#I am grouping/collapsing the factors of variables based on their similar nature
mDatao$admission_source <- case_when(mDatao$admission_source %in% c("1","2","3") ~ "Physician Referral",mDatao$admission_source %in% c("4","5","6","8","9","10","11","12","13","14","15","17","18","19","20","21","22","23","24","25","26")~"Other",TRUE~"Emergency Room")
mDatao$admission_source <- as.factor(mDatao$admission_source)
barplot(table(mDatao$admission_source), main = "Post collapsing and changing type of admission")
#I am renaming the column "discharge_disposition_id" to "discharge_disposition"
colnames(mDatao)[6] <- "discharge_disposition"
barplot(table(mDatao$discharge_disposition))
#collapsing some other variables and grouping according to convenience
mDatao$discharge_disposition <- case_when(mDatao$discharge_disposition %in% "1" ~ "Home", TRUE ~ "Other")
mDatao$discharge_disposition <- as.factor(mDatao$discharge_disposition)
barplot(table(mDatao$discharge_disposition), main = "After collapsing and changing the type")
mDatao$diag_1 <- as.character(mDatao$diag_1)
# All the diagnoses variables values are present in ICD-9 codes format, based on which I am grouping them according to the type of the problem found in diagnoses
mDatao<- mutate(mDatao, primary_diagnosis = ifelse(str_detect(diag_1, "V") | str_detect(diag_1, "E"),"Other",
ifelse(str_detect(diag_1, "250"), "Diabetes",
ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | as.integer(diag_1) == 785, "Circulatory",
ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | as.integer(diag_1) == 786, "Respiratory",
ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | as.integer(diag_1) == 787, "Digestive",
ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | as.integer(diag_1) == 788, "Genitourinary",
ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), "Neoplasms", ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), "Musculoskeletal", ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), "Injury", "Other"))))))))))
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 390 & as.integer(diag_1) <= 459) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 460 & as.integer(diag_1) <= 519) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 520 & as.integer(diag_1) <= 579) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 580 & as.integer(diag_1) <= 629) | :
## NAs introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 140 & as.integer(diag_1) <= 239), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 710 & as.integer(diag_1) <= 739), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), : NAs
## introduced by coercion
## Warning: Problem with `mutate()` input `primary_diagnosis`.
## i NAs introduced by coercion
## i Input `primary_diagnosis` is `ifelse(...)`.
## Warning in ifelse((as.integer(diag_1) >= 800 & as.integer(diag_1) <= 999), : NAs
## introduced by coercion
mDatao$primary_diagnosis <- as.factor(mDatao$primary_diagnosis)
table(mDatao$primary_diagnosis)
##
## Circulatory Diabetes Digestive Genitourinary Injury
## 28887 7870 9045 4870 6590
## Musculoskeletal Neoplasms Other Respiratory
## 4717 3013 17169 13511
#removing "diag variables"
mDatao$diag_1 <- NULL
mDatao$diag_2 <- NULL
mDatao$diag_3 <- NULL
barplot(table(mDatao$age))
#I am regrouping the "age" to [0-40],[40-50],[50-60],[60-70],[70-80],[80-100]
mDatao$age <- case_when(mDatao$age %in% c("[0-10)","[10-20)","[20-30)","[30-40)") ~ "[0-40]",
mDatao$age %in% c("[80-90)","[90-100)") ~ "[80-100]",
mDatao$age %in% "[40-50)" ~ "[40-50]",
mDatao$age %in% "[50-60)" ~ "[50-60]",
mDatao$age %in% "[60-70)" ~ "[60-70]", TRUE ~ "[70-80]")
barplot(table(mDatao$age), main = "Regrouped Age")
mDatao$age <- as.factor(mDatao$age)
#Now I am categorizing "readmitted" variable to 1 -if the patient was readmitted within 30 days, 0 -if the readmission was after 30 days or there is no readmission
mDatao$readmitted <- case_when(mDatao$readmitted %in% c(">30","NO") ~ "0", TRUE ~ "1")
mDatao$readmitted <- as.factor(mDatao$readmitted)
levels(mDatao$readmitted)
## [1] "0" "1"
#I am removing multiple records of a patient who had multiple encounters
mDatao <- mDatao[!duplicated(mDatao$patient_nbr),]
#Now I am also removing "patient_nbr"
mDatao$patient_nbr <- NULL
dim(mDatao)
## [1] 67128 41
#I am identifying the variables that has outliers and removing them
par(mfrow = c(2,4))
boxplot(mDatao$time_in_hospital, main = "time_in_hospital")
boxplot(mDatao$number_outpatient, main = "number_outpatient")
boxplot(mDatao$number_emergency, main = "number_emergency")
boxplot(mDatao$num_lab_procedures, main = "num_lab_procedures")
boxplot(mDatao$number_diagnoses, main = "number_diagnoses")
boxplot(mDatao$number_inpatient, main = "number_inpatient")
boxplot(mDatao$num_procedures, main = "num_procedures")
boxplot(mDatao$num_medications, main = "num_medications")
#These there variables has scattered values, hence removing them
mDatao$number_emergency <- NULL
mDatao$number_inpatient <- NULL
mDatao$number_outpatient <- NULL
#Trying to remove outliers
outliers_remover <- function(a){
df <- a
aa <- c()
count <- 1
for(i in 1:ncol(df)){
if(is.integer(df[,i])){
Q3 <- quantile(df[,i], 0.75, na.rm = TRUE)
Q1 <- quantile(df[,i], 0.25, na.rm = TRUE)
IQR <- Q3 - Q1 #IQR(df[,i])
upper <- Q3 + 1.5 * IQR
lower <- Q1 - 1.5 * IQR
for(j in 1:nrow(df)){
if(is.na(df[j,i]) == TRUE){
next
}
else if(df[j,i] > upper | df[j,i] < lower){
aa[count] <- j
count <- count+1
}
}
}
}
df <- df[-aa,]
}
mDatao <- outliers_remover(mDatao)
pairs.panels(mDatao[c("time_in_hospital", "num_lab_procedures", "num_procedures", "num_medications", "number_diagnoses")])
table(mDatao$readmitted)
##
## 0 1
## 55628 5559
mDatao$repaglinide <- NULL
mDatao$nateglinide <- NULL
mDatao$chlorpropamide <-NULL
mDatao$acetohexamide <- NULL
mDatao$tolbutamide <- NULL
mDatao$acarbose <- NULL
mDatao$miglitol <- NULL
mDatao$troglitazone <- NULL
mDatao$tolazamide <- NULL
mDatao$glyburide.metformin <- NULL
mDatao$glipizide.metformin <- NULL
mDatao$glimepiride.pioglitazone <- NULL
mDatao$metformin.rosiglitazone <- NULL
mDatao$metformin.pioglitazone <- NULL
dim(mDatao)
## [1] 61187 24
Features of medicines that can be removed as suggested by Boruta
There are many techniques like Boruta, Mars in R which help us to identify important variables (reference:http://r-statistics.co/Variable-Selection-and-Importance-With-R.html)
# ensure results are repeatable
set.seed(100)
boruta <- Boruta(readmitted ~., data = mDatao, doTrace = 2)
## 1. run of importance source...
## Computing permutation importance.. Progress: 41%. Estimated remaining time: 43 seconds.
## Computing permutation importance.. Progress: 88%. Estimated remaining time: 8 seconds.
## 2. run of importance source...
## Computing permutation importance.. Progress: 55%. Estimated remaining time: 25 seconds.
## 3. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## 4. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## 5. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## 6. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## 7. run of importance source...
## Computing permutation importance.. Progress: 57%. Estimated remaining time: 23 seconds.
## 8. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## 9. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## 10. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## 11. run of importance source...
## Computing permutation importance.. Progress: 59%. Estimated remaining time: 21 seconds.
## 12. run of importance source...
## Computing permutation importance.. Progress: 58%. Estimated remaining time: 22 seconds.
## After 12 iterations, +15 mins:
## confirmed 16 attributes: A1Cresult, admission_source, admission_type, age, change and 11 more;
## rejected 2 attributes: glimepiride, glyburide;
## still have 5 attributes left.
## 13. run of importance source...
## Computing permutation importance.. Progress: 67%. Estimated remaining time: 14 seconds.
## 14. run of importance source...
## Computing permutation importance.. Progress: 67%. Estimated remaining time: 14 seconds.
## 15. run of importance source...
## Computing permutation importance.. Progress: 73%. Estimated remaining time: 11 seconds.
## 16. run of importance source...
## Computing permutation importance.. Progress: 70%. Estimated remaining time: 13 seconds.
## After 16 iterations, +20 mins:
## rejected 2 attributes: glipizide, rosiglitazone;
## still have 3 attributes left.
## 17. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
## 18. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 19. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 20. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
## 21. run of importance source...
## Computing permutation importance.. Progress: 86%. Estimated remaining time: 5 seconds.
## 22. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
## 23. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 24. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
## 25. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 26. run of importance source...
## Computing permutation importance.. Progress: 85%. Estimated remaining time: 5 seconds.
## After 26 iterations, +29 mins:
## confirmed 1 attribute: diabetesMed;
## still have 2 attributes left.
## 27. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
## 28. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 29. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 30. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 31. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
## 32. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
## 33. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 34. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
## After 34 iterations, +37 mins:
## rejected 1 attribute: pioglitazone;
## still have 1 attribute left.
## 35. run of importance source...
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 3 seconds.
## 36. run of importance source...
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 3 seconds.
## 37. run of importance source...
## Computing permutation importance.. Progress: 73%. Estimated remaining time: 11 seconds.
## 38. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 39. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 40. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 41. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 42. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
## 43. run of importance source...
## Computing permutation importance.. Progress: 79%. Estimated remaining time: 8 seconds.
## 44. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 6 seconds.
## 45. run of importance source...
## Computing permutation importance.. Progress: 82%. Estimated remaining time: 6 seconds.
## 46. run of importance source...
## Computing permutation importance.. Progress: 83%. Estimated remaining time: 6 seconds.
## 47. run of importance source...
## Computing permutation importance.. Progress: 84%. Estimated remaining time: 5 seconds.
## 48. run of importance source...
## Computing permutation importance.. Progress: 78%. Estimated remaining time: 8 seconds.
## 49. run of importance source...
## Computing permutation importance.. Progress: 80%. Estimated remaining time: 7 seconds.
## 50. run of importance source...
## Computing permutation importance.. Progress: 81%. Estimated remaining time: 7 seconds.
## 51. run of importance source...
## Computing permutation importance.. Progress: 81%. Estimated remaining time: 7 seconds.
## After 51 iterations, +53 mins:
## confirmed 1 attribute: gender;
## no more attributes left.
plot(boruta, las = 2, cex.axis = 0.5)
plotImpHistory(boruta)
attStats(boruta)
## meanImp medianImp minImp maxImp normHits
## race 5.9473145 5.8424548 3.7600833 8.0492192 1.00000000
## gender 2.7864856 2.6946927 0.5195113 5.2730131 0.74509804
## age 22.5000512 22.5729643 18.3341191 26.2400991 1.00000000
## admission_type 23.0409482 23.2776984 18.4581737 26.1686836 1.00000000
## discharge_disposition 29.5168999 29.5817939 25.4113431 34.4669588 1.00000000
## admission_source 24.8510776 24.8877501 21.4608738 28.3055906 1.00000000
## time_in_hospital 30.8222395 30.9236807 26.6336428 35.4899781 1.00000000
## num_lab_procedures 26.9177497 27.2026971 22.9035450 30.1815857 1.00000000
## num_procedures 20.1157786 19.7861094 16.8729665 23.5431608 1.00000000
## num_medications 33.1378005 33.1587865 29.0765147 37.1821540 1.00000000
## number_diagnoses 22.9835165 23.1118335 19.3429959 27.9644279 1.00000000
## max_glu_serum 15.8074993 15.8644383 12.7864356 17.8362448 1.00000000
## A1Cresult 11.6920435 11.5936744 9.9718235 13.7780133 1.00000000
## metformin 8.5465352 8.6001694 6.7347819 10.3895841 1.00000000
## glimepiride -0.8051658 -0.9638350 -2.5045546 0.8300808 0.00000000
## glipizide 0.8175565 0.9509885 -0.6958722 2.0485499 0.01960784
## glyburide -2.6147356 -2.3578022 -4.9575220 -1.3527214 0.00000000
## pioglitazone 0.8642983 0.7435128 -1.6428089 3.3264061 0.13725490
## rosiglitazone 0.3221866 0.3920429 -1.6624047 1.9048797 0.01960784
## insulin 6.7471171 7.0152141 4.0320129 9.2008328 1.00000000
## change 13.3461142 12.9683995 11.1588505 16.1669477 1.00000000
## diabetesMed 7.4033560 8.3950637 0.4972511 11.4332937 0.92156863
## primary_diagnosis 17.7373911 17.6131008 15.8138216 21.0660926 1.00000000
## decision
## race Confirmed
## gender Confirmed
## age Confirmed
## admission_type Confirmed
## discharge_disposition Confirmed
## admission_source Confirmed
## time_in_hospital Confirmed
## num_lab_procedures Confirmed
## num_procedures Confirmed
## num_medications Confirmed
## number_diagnoses Confirmed
## max_glu_serum Confirmed
## A1Cresult Confirmed
## metformin Confirmed
## glimepiride Rejected
## glipizide Rejected
## glyburide Rejected
## pioglitazone Rejected
## rosiglitazone Rejected
## insulin Confirmed
## change Confirmed
## diabetesMed Confirmed
## primary_diagnosis Confirmed
boruta
## Boruta performed 51 iterations in 53.09086 mins.
## 18 attributes confirmed important: A1Cresult, admission_source,
## admission_type, age, change and 13 more;
## 5 attributes confirmed unimportant: glimepiride, glipizide, glyburide,
## pioglitazone, rosiglitazone;
I am splitting the data in to 80% for training and 20% for testing
set.seed(100)
train <- createDataPartition(mDatao$readmitted, p = 0.8, list = FALSE)
training <- mDatao[train, ]
testing <- mDatao[-train, ]
#checking dependent variable(training set)
table(training$readmitted)
##
## 0 1
## 44503 4448
for validation purposes I am looking at implementing K fold cross validation, the working of this technique is to be understood and hence not created validation set yet